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Abstract In this paper, we introduce Continuation Passing C (CPC), a programming lan- 
guage for concurrent systems in which native and cooperative threads are unified and pre- 
sented to the programmer as a single abstraction. The CPC compiler uses a compilation 
technique, based on the CPS transform, that yields efficient code and an extremely light- 
weight representation for contexts. We provide a proof of the correctness of our compilation 
scheme. We show in particular that lambda-lifting, a common compilation technique for 
functional languages, is also correct in an imperative language like C, under some condi- 
tions enforced by the CPC compiler. The current CPC compiler is mature enough to write 
substantial programs such as Hekate, a highly concurrent BitTorrent seeder. Our benchmark 
results show that CPC is as efficient, while using significantly less space, as the most effi- 
cient thread libraries available. 
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1 Introduction 



Most computer programs are concurrent programs, which need to perform multiple tasks 
at a given time. For example, a network server needs to serve multiple clients at a time; 
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a program with a graphical user interface needs to expect keystrokes and mouse clicks at 
multiple places; and a network program with a graphical interface (e.g. a web browser) 
needs to do both. 

The dominant abstraction for concurrency is provided by threads, or lightweight pro- 
cesses. When programming with threads, a process is structured as a dynamically changing 
number of independently executing threads that share a single heap (and possibly other data, 
such as global variables). The alternative to threads is event-loop or event-driven program- 
ming. An event-driven program interacts with its environment by reacting to a set of stimuli 
called events. At any given point in time, to every event is associated a piece of code known 
as the handler for this event. A global scheduler, known as the event loop, repeatedly waits 
for an event to occur and invokes the associated handler. 

Unlike threads, event handlers do not have an associated stack; event-driven programs 
are therefore more lightweight and often faster than their threaded counterparts. However, 
because it splits the flow of control into multiple tiny event-handlers, event-driven program- 
ming is difficult and error-prone. Additionally, event-driven programming alone is often not 
powerful enough, in particular when accessing blocking APIs or using multiple processor 
cores; it is then necessary to write hybrid code, that uses both threads and event handlers, 
which is even more difficult. 

Since event-driven programming is more difficult but more efficient than threaded pro- 
gramming, it is natural to want to at least partially automate it. 

Continuation Passing C Continuation Passing C (CPC) is an extension of the C program- 
ming language with concurrency primitives which is implemented by a series of source-to- 
source transformations, including a phase of lambda-lifting (Section [575b and a transforma- 
tion into Continuation-Passing Style (Section [5.2b . After this series of transformations, the 
CPC translator yields a program in hybrid style, most of it event-driven, but using the native 
threads of the underlying operating system wherever this has been specified by the CPC 
programmer, for example in order to call blocking APIs or distribute CPU-bound code over 
multiple cores or processors. 

Although it is implemented as a series of well-understood source-to-source translations, 
CPC yields code that is just as fast as the fastest thread libraries available to us; we fur- 
ther believe that the code it generates is similar to carefully tuned code written by a human 
programmer with a lot of experience with event-driven code. And although it is an exten- 
sion of the C programming language, a language notoriously hostile to formal proofs, the 
correctness of the transformations used by CPC have been proved; while we perform these 
proofs in a simplified dialect, there is nothing in principle that prevents a complete proof of 
the correctness of CPC. We believe that this property makes CPC unique among the sys- 
tems for efficient implementation of concurrency in a widely used imperative programming 
language. 

In this paper, we give an introduction to CPC programming, we describe informally 
the transformations performed by CPC, and give an outline of the proof of correctness; the 
complete proof can be found in a companion technical report |28|. 

1.1 Outline 

The rest of this paper is organised as follows. In Section|2] we present previous work related 
to threads and event-driven programming, and to the compilation techniques used by the 
CPC translator. In Section [3] we give an overview of the CPC language through excerpts 
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of code from Hekate, the most significant program written in CPC so far. In Section [4j we 
show that CPC threads are as fast as the most efficient thread libraries available to us. In 
Section[5j we detail the several passes of the CPC compilation technique to translate threads 
into events, and in Section|6]we prove the correctness of one of these passes, lambda-lifting, 
in the context of an imperative, call-by-value language restricted to functions called in tail 
position. We conclude in Section[7] 



2 Related work 

2.1 Continuations and concurrency 

Continuations offer a natural framework to implement concurrency systems in functional 
languages: they capture "the rest of the computation" much like the state of an imperative 
program is captured by the call stack. Thread-like primitives may be built with either first- 
class continuations, or encapsulated within a continuation monad. 

The former approach is best illustrated by Concurrent ML constructs 1381 , implemented 
on top of SML/NJ's first-class continuations, or by the way coroutines and engines are typ- 
ically implemented in Scheme using the call/cc operator [22,15] (previously known as 
catch (52)). Stackless Python [49] uses first-class continuations to implement generators, 
which are in turn used to implement concurrency primitives. Scala also uses first-class con- 
tinuations, through the shift and reset operators, to implement concurrency primitives 
and asynchronous 10 [40|. 

Explicit translation into continuation-passing style, often encapsulated within a monad, 
is used in languages lacking first-class continuations. In Haskell, the original idea of a con- 
currency monad is due to Scholz 1411 . and extended by Claessen |9] to a monad transformer 
yielding a concurrent version of existing monads. Li et. al 1331 use a continuation monad to 
lazily translate the thread abstraction exposed to the programmer into events scheduled by 
an event loop. In Objective Caml, Vouillon's Iwt [501 provides a lightweight alternative to 
native threads, with the ability to execute blocking tasks in a separate thread pool. 



2.2 Transformation techniques 

The three main techniques used by CPC — CPS conversion, lambda-lifting and splitting 
— are fairly standard techniques for compiling functional languages, or at least languages 
providing inner functions. 

The conversion into continuation-passing style, or CPS conversion, has been discovered 
and studied many times in different contexts [37,46,39]. It is used for instance to compile 
Scheme (Rabbit |44)) and ML (SML/NJ O), both of them exposing continuations to the 
programmer. 

Lambda-lifting, also called closure-conversion, is a standard technique to remove free 
variables. It is introduced by Johnsson 1251 and made more efficient by Danvy and Schultz 
1111 . Fischbach and Hannan prove its correctness for the Scheme language! 17]. Although 
environments are a more common way to handle free variables, some implementations use 
lambda-lifting; for instance, the Twobit Scheme-to-C compiler 1 10]. 

We call splitting the conversion of a complex flow of control into mutually recursive 
function calls. Van Wijngaarden is the first to describe such a transformation, in a source- 
to-source translation for Algol 60 1 54]. The idea is then used by Landin to formalise a 
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translation between Algol and the lambda-calculus 1 3 1 ], and by Steele and Sussman to ex- 
press gotos in applicative languages such as LISP or Scheme 1441 . Thielecke adapts van 
Wijngaarden's transformation to the C language, albeit in a restrictive way 1481 . 

We are aware of two implementations of splitting: Weave and TameJS. Weave is an 
unpublished tool used at IBM around year 2000 to write firmware and drivers for SSA-SCSI 
RAID storage adapters 1291 . It translates annotated Woven-C code, written in threaded style, 
into C code hooked into the underlying event-driven kernel. TamejQ by the authors of the 
C++ Tame library 1301 . is a similar tool for Javascript where event-driven programming is 
made mandatory by the lack of concurrency primitives 

2.3 Threads and events 

There has been a lot of research to provide efficient threads, as well as to make event-driven 
programming easier [4 53,36 1. We focus here on results involving a transformation between 
threads and events, or building bridges between them. 

Adya et al. [Q present adaptors between event-driven and threaded code to write pro- 
grams mixing both styles. They first introduce the notion of stack ripping to describe the 
manual stack management required by event-driven style. 

Duff introduces a technique, known as Duff's device 1131 , to express general loop un- 
rolling directly in C, using the switch statement. Since then, this technique has been em- 
ployed multiple times to express state machines and event-driven programs in a threaded 
style0 For instance, it is used by Tatham to implement coroutines in C 1471 . Other C li- 
braries later expanded this idea, such as protothreads 1 14] and FairThreads' automata (6). 
These libraries help keep a clearer flow of control but they provide no automatic handling 
of local variables: the programmer is expected to save them manually in his own data struc- 
tures, just like in event-driven style. 

Tame [ 30] is a C++ language extension and library which exposes events to the program- 
mer while introducing state machines to avoid the stack ripping issue and retain a thread-like 
feeling. The programmer needs to annotate local variables that must be saved across context 
switches. 

TaskJava [18] implements the same idea than Tame, in Java, but preserves local variables 
automatically, storing them in a state record. Kilim |43 ] is a message-passing framework for 
Java providing actor-based, lightweight threads. It is also implemented by a partial CPS 
conversion performed on annotated functions, but contrary to TaskJava, it works at the JVM 
bytecode level. 

AC is a set of language constructs for composable asynchronous IO in C and C++ 1211 . 
Harris et al. introduce do. .finish and async operators to write asynchronous requests in 
a synchronous style, and give an operational semantics. The language constructs are some- 
what similar to those of Tame but the implementation is widely different, using LLVM code 
blocks or macros based on GCC nested functions rather than source-source transformations. 

Haller and Odersky [ 20 1 advocate unification of thread-based and event-based models 
through actors, with the react and receive operators provided by the Scala Actors library; 
suspended actors are represented by continuations. 

1 http://tamejs.org/ 

2 Note that, contrary to TameJS, the original Tame implementation in C++ does not use splitting but a state 
machine using switches. 

3 This abuse was already envisioned by Duff in 1983: "I have another revolting way to use switches to 
implement interrupt driven state machines but it's too horrid to go into." 1131 
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3 The CPC language 

Together with two undergraduate students, we taught ourselves how to program in CPC 
by writing Hekate, a BitTorrent seeder, a massively concurrent network server designed 
to efficiently handle tens of thousands of simultaneously connected peers 127 O il. In this 
section, we give an overview of the CPC language through several programming idioms that 
we discovered while writing Hekate. 

3.1 Cooperative CPC threads 

The extremely lightweight, cooperative threads of CPC lead to a "threads are everywhere" 
feeling that encourages a somewhat unusual programming style. 

Lightweight threads Contrary to the common model of using one thread per client, Hekate 
spawns at least three threads for every connecting peer: a reader, a writer, and a timeout 
thread. Spawning several CPC threads per client is not an issue, especially when only a few 
of them are active at any time, because idle CPC threads carry virtually no overhead. 

The first thread reads incoming requests and manages the state of the client. The Bit- 
Torrent protocol defines two states for interested peers: "unchoked," i.e. currently served, 
and "choked." Hekate maintains 90 % of its peers in choked state, and unchokes them in a 
round-robin fashion. 

The second thread is in charge of actually sending the chunks of data requested by the 
peer. It usually sleeps on a condition variable, and is woken up by the first thread when 
needed. Because these threads are scheduled cooperatively, the list of pending chunks is 
manipulated by the two threads without need for a lock. 

Every read on a network interface is guarded by a timeout, and a peer that has not been 
involved in any activity for a period of time is disconnected. Earlier versions of Hekate 
which did not include this protection would end up clogged by idle peers, which prevented 
new peers from connecting. 

In order to simplify the protocol-related code, timeouts are implemented in the buffered 
read function, which spawns a new timeout thread on each invocation. This temporary third 
thread sleeps for the duration of the timeout, and aborts the I/O if it is still pending. Because 
most timeouts do not expire, this solution relies on the efficiency of spawning and context- 
switching short-lived CPC threads (see Section[4]i. 

Native and cps functions CPC threads might execute two kinds of code: native functions and 
cps functions (annotated with the cps keyword). Intuitively, cps functions are interruptible 
and native functions are not: it is possible to interrupt the flow of a block of cps code in order 
to pass control to another piece of code, to wait for an event to happen or to switch to another 
scheduler (Section [33l l. Note that the cps keyword does not mean that the function is written 
in continuation-passing style, but rather that it is to be CPS-converted by the CPC translator. 
Native code, on the other hand, is "atomic": if a sequence of native code is executed in 
cooperative mode, it must be completed before anything else is allowed to run in the same 
scheduler. From a more technical point of view, cps functions are compiled by performing 
a transformation to Continuation Passing Style (CPS), while native functions execute on the 
native stack. 

There is a global constraint on the call graph of a CPC program: a cps function may only 
be called by a cps function; equivalently, a native function can only call native functions — 
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cps void 

listening(hashtable * table) { 
/* ... */ 
while (1) { 

cpc_io_wait(socket_fd, CPC_I0_IN) ; 

client_fd = accept(socket_fd, ...); 

cpc_spawn client (table , client_f d) ; 

} 

} 

Fig. 1 Accepting connections and spawning threads 

but a cps function may call a native function. This means that at any point in time, the 
dynamic chain consists of a "cps stack" of cooperating functions followed by a "native 
stack" of regular C functions. Since context switches are forbidden in native functions, only 
the cps stack needs to be saved and restored when a thread cooperates. 

CPC primitives CPC provides a set of primitive cps functions, which allow the programmer 
to schedule threads and wait for some events. These primitive functions could not have been 
defined in user code: they must have access to the internals of the scheduler to operate. Since 
they are cps functions, they can only be called by another cps function. 

Figure[T]shows an example of a cps function: listening calls the primitive cpc_io_wait 
to wait for the file descriptor socket_f d to be ready, before accepting incoming connections 
with the native function accept and spawning a new thread for each of them. Note the use 
of the cpc_spawn keyword to create a new thread executing the cps function client. 

The CPC language provides five cps primitives to suspend and synchronise threads on 
some events. The simplest one is cpc_yield, which yields control to the next thread to 
be executed. The primitives cpc_io_wait and cpc_sleep suspend the current thread until 
a given file descriptor has data available or some time has elapsed, respectively. A thread 
can wait on some condition variable [23 1 with cpc_wait ; threads suspended on a condition 
variable are woken with the (non-cps) functions cpc_signal and cpc_signal_all. To allow 
early interruptions, cpc_io_wait and cpc_sleep also accept an optional condition variable 
which, if signaled, will wake up the waiting thread. 

The fifth cps primitive, cpc_link, is used to control how threads are scheduled. We give 
more details about it in Section [3~3l 

We found that these five primitives are enough to build more complex synchronisation 
constructs and cps functions, such as barriers or retriggerable timeouts. Some of these gen- 
erally useful functions, written in CPC and built above the CPC primitives, are distributed 
with CPC and form the CPC standard library. 

3.2 Comparison with event-driven programming 

Code readability Hekate's code is much more readable than its event-driven equivalents. 
Consider for instance the BitTorrent handshake, a message exchange occurring just after a 
connection is established. In Transmission a popular and efficient BitTorrent client written 
in (mostly) event-driven style, the handshake is a complex piece of code, spanning over a 
thousand lines in a dedicated file. By contrast, Hekate's handshake is a single function of 
less than fifty lines including error handling. 

4 http://www.transmissionbt.com/ 
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While some of Transmission's complexity is explained by its support for encrypted con- 
nexions, Transmission's code is intrinsically much more messy due to the use of callbacks 
and a state machine to keep track of the progress of the handshake. This results in an obfus- 
cated flow of control, scattered through a dozen of functions (excluding encryption-related 
functions), typical of event-driven code. 



Expressivity Surprisingly enough, CPC threads turn out to naturally express some idioms 
that are more commonly associated with event-driven style. 

A case in point: buffer allocation for reading data from the network. When a native 
thread performs a blocking read, it needs to allocate the buffer before the read system call; 
when many threads are blocked waiting for a read, these buffers add up to a significant 
amount of storage. In an event-driven program, it is possible to delay allocating the buffer 
until after an event indicating that data is available has been received. 

The same technique is not only possible, but actually natural in CPC: buffers in Hekate 
are only allocated after cpc_io_wait has successfully returned. This provides the reduced 
storage requirements of an event-driven program while retaining the linear flow of control 
of threads. 



3.3 Detached threads 

While cooperative, deterministically scheduled threads are less error-prone and easier to 
reason about than preemptive threads, there are circumstances in which native operating 
system threads are necessary. In traditional systems, this implies either converting the whole 
program to use native threads, or manually managing both kinds of threads. 

A CPC thread can switch from cooperative to preemptive mode at any time by using the 
cpc_link primitive (inspired by FairThreads' f t_thread_link [6|). A cooperative thread is 
said to be attached to the default scheduler, while a preemptive one is detached. 

The cpc_link primitive takes a single argument, a scheduler, either the default event 
loop (for cooperative scheduling) or a thread pool (for preemptive scheduling). It returns the 
previous scheduler, which makes it possible to eventually restore the thread to its original 
state. Syntactic sugar is provided to execute a block of code in attached or detached mode 
(cpc_attached, cpc_detached). 

Hekate is written in mostly non-blocking cooperative style; hence, Hekate's threads re- 
main attached most of the time. There are a few situations, however, where the ability to 
detach a thread is needed. 



Blocking OS interfaces Some operating system interfaces, like the getaddrinf o DNS re- 
solver interface, may block for a long time (up to several seconds). Although there exist 
several libraries which implement equivalent functionality in a non-blocking manner, in 
CPC we simply enclose the call to the blocking interface in a cpc_detached block (see 
Figure |2^). 

Figure I2J5 shows how cpc_det ached is expanded by the translator into two calls to 
cpc_link. Note that CPC takes care to attach the thread before returning to the caller func- 
tion, even though the return statement is inside the cpc_detached block. 
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cpc_detached { 

rc = getaddrinf o (name , ...) 
return rc ; 

} 

(a) 



cpc_scheduler *s = 

cpc_link(cpc_def ault_threadpool) ; 
rc = getaddrinf o (name, ...) 
cpc_link(s) ; 
return rc; 

(b) 



Fig. 2 Expansion of cpc_detached in terms of cpc_link 

pref etch(source , length); 
cpc_yield() ; 

if (! incore (source , length)) { 
cpc_yield() ; 

if (! incore (source , length)) { 
cpc_detached { 

rc = cpc_write (f d, source, length); 

} 

goto done; 

} 

} 

rc = cpc_write(f d, source, length); 



done : 



/* (1) */ 

/* (2) */ 

/* (3) */ 

/* (4) */ 

/* (5) */ 

/* (6) */ 



/* (7) */ 



The functions prefetch and incore are thin wrappers around the posix_madvise and mincore system 
calls. 



Fig. 3 An example of hybrid programming (non-blocking read) 



Blocking library interfaces Hekate uses the curl librarytjto contact BitTorrent trackers over 
HTTP. Curl offers both a simple, blocking interface and a complex, asynchronous one. We 
decided to use the one interface that we actually understand, and therefore call the blocking 
interface from a detached thread. 

Parallelism Detached threads make it possible to run on multiple processors or processor 
cores. Hekate does not use this feature, but a CPU-bound program would detach computa- 
tionally intensive tasks and let the kernel schedule them on several processing units. 



3.4 Hybrid programming 

Most realistic event-driven programs are actually hybrid programs ["36 . 53"]: they consist of 
a large event loop, and a number of threads (this is the case, by the way, of the Transmission 
BitTorrent client mentioned above). Such blending of native threads with event-driven code 
is made very easy by CPC, where switching from one style to the other is a simple matter of 
using the cpc_link primitive. 

This ability is used in Hekate for dealing with disk reads. Reading from disk might block 
if the data is not in cache; however, if the data is already in cache, it would be wasteful to pay 
the cost of a detached thread. This is a significant concern for a BitTorrent seeder because 
the protocol allows requesting chunks in random order, making kernel readahead heuristics 
inefficient. 

The actual code is shown in Figure [3) it sends a chunk of data from a memory-mapped 
disk file over a network socket. In this code, we first trigger an asynchronous read of the 



5 http://curl.haxx.se/libcurl/ 
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on-disk data (1), and immediately yield to threads servicing other clients (2) in order to give 
the kernel a chance to perform the read. When we are scheduled again, we check whether 
the read has completed (3); if it has, we perform a non-blocking write (7); if it hasn't, we 
yield one more time (4) and, if that fails again (5), delegate the work to a native thread which 
can block (6). 

Note that this code contains a race condition: the prefetched block of data could have 
been swapped out before the call to cpc_write, which would stall Hekate until the write 
completes. Moreover, this race condition is even more likely to appear as load increases 
and on devices with constrained resources. To avoid this race condition and ensure non- 
blocking disk reads, one could use asynchronous I/O. However, while the Linux kernel does 
provide a small set of asynchronous I/O system calls, we found them scarcely documented 
and difficult to use: they work only on some file systems, impose alignment restrictions on 
the length and address of buffers, and disable the caching performed by the kernel. We have 
therefore not experimented with them. 

Note further that the call to cpc_write in the cpc_detached block (6) could be replaced 
by a call to write: we are in a native thread here, so the non-blocking wrapper is not needed. 
However, the CPC primitives such as cpc_io_wait are designed to act sensibly in both 
attached and detached mode; this translates to more complex functions built upon them, and 
cpc_write simply behaves as write when invoked in detached mode. For simplicity, we 
choose to use the CPC wrappers throughout our code. 

4 Experimental results 

The CPC language provides no more than half a dozen very low-level primitives. The stan- 
dard library and CPC programs are implemented in terms of this small set of operations, and 
are therefore directly dependent on their performance. In Section RTTl we show the results of 
benchmarking individual CPC primitives against other thread libraries. In these benchmarks, 
CPC turns out to be comparable to the fastest thread libraries available to us. 

In the absence of reliable information on the relative dynamic frequency of cps function 
calls and thread synchronisation primitives in CPC programs, it is not clear what the per- 
formance of individual primitives implies about the performance of a complete program. In 
Section 14.21 we present the results of benchmarking a set of naive web servers written in 
CPC and in a number of thread libraries. 

Finally, in Section l4~3l we present a number of insights that we gained by working with 
Hekate, our BitTorrent seeder written in CPC. 

We compare CPC with the following thread libraries: nptl, the native thread library 
in GNU libc 2.13 (or jiiClibc 0.9.32 for our tests on embedded hardware) G2); GNU Pth 
version 2.0.7 1161 ; State Threads (ST) version 1.9 1421 . Nptl is a kernel thread library, while 
GNU Pth and ST are cooperative user-space thread libraries. 



4. 1 Speed of CPC primitives 
4.1.1 Space utilisation 

On a 64-bit machine with 4 GB of physical memory and no swap space, CPC can handle up 
to 50. 1 million simultaneous threads, which implies an average memory usage of roughly 82 
bytes per continuation. This figure compares very favourably to both kernel and user-space 
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thread libraries (see Figure[4]l, which our tests have shown to be limited on the same system 
to anywhere from 32000 to 934600 threads in their default configuration, and to 961 400 
threads at most after some tuning. 

4.1.2 Speed 

Figure [5] presents timings on three different processors: the Core 2 Duo, a superscalar out- 
of-order 64-bit processor, the Pentium-M, a slightly less advanced out-of-order 32-bit pro- 
cessor, and the MIPS 4Kc, a simple in-order 32-bit RISC chip with a short pipeline. 



Library 


Number of threads 


nptl 


32 330 


Pth 


700000 (est.) " 


ST 


934600 


ST (4kB stacks) 


961 400 


CPC 


50190000 



Fig. 4 Number of threads possible with 4 GB of memory 



Function calls Modern processors include specialised hardware for optimising function 
calls; obviously, this hardware is not able to optimise CPS-converted function calls, which 
are therefore dramatically slower than direct-style (non-cps) calls. In our very simple bench- 
mark (two embedded loops containing a function call), the Core 2 Duo was able to com- 
pletely hide the cost of a trivial function cal£| while the function call costs three processor 
cycles on the Pentium-M and just one on the MIPS processor; we do not claim to understand 
the hardware-level magic that makes such results possible. 

With CPS-converted calls, on the other hand, our benchmarking loop takes on the order 
of 100 processor cycles per iteration (400 on MIPS). This apparently poor result is mitigated 
by the fact that after goto elimination and lambda-lifting, the loop consists of four CPS 
function calls; hence, on the Pentium-M, a CPS function call is just 6 times slower than a 
direct-style call, a surprisingly good result considering that our continuations are allocated 
on the heap |34|. On the MIPS chip the slowdown is closer to a factor of 100, more in line 
with what we expected. 

Context switches and thread creation The "real" primitives are pleasantly fast in CPC. Con- 
text switches have similar cost to function calls, which is not surprising since they compile 
to essentially the same code. While the benchmarks make them appear to be much faster 
than in any other implementation, including ST, this result is skewed by a flaw in the ST li- 
brary, which goes through the event loop on every st_yield operation. Context switches on 

" Pth never completed, the given value is an educated guess. 

* On x86-64, performing a direct-style function call adds no measurable delay. 

c The behaviour of the system call schecLyield has changed in Linux 2.6.23, where it has the 
side-effect of reducing the priority of the calling thread. Until Linux 2.6.38, setting the kernel variable 
sched_compat_yield restored the previous behaviour, which we have done in our x86-32 and MIPS-32 
benchmarks. This knob having been removed in Linux 2.6.39, the nptl results on x86-64 are highly suspect. 

6 Disassembly of the binary shows that the function call is present. 
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Library loop call* cps-call switch cond spawn 



nptl (1 core) 


2 2 439 


1318 


4 000 000 c 


nptl (2 cores) 


2 2 135 


1791 


11037 


Pth 


2 2 2130 


2 602 


6940 


ST 


2 2 170 


25 


411 


CPC 


2 2 20 16 


36 


74 


x86-64: Intel Core 2 Duo at 3.17 GHz, Linux 2.6.39 


Library 


loop call cps-call switch 


cond 


spawn 


nptl 


2 4 691 


2426 


24575 


ST 


2 4 1 003 


96 


2293 


CPC 


2 4 54 46 


91 


205 


x86-32: Intel Pentium M at 1.7 GHz, Linux 2.6.38 


Library 


loop call cps-call switch 


cond 


spawn 


nptl 


58 63 6 310 63 305 


933 689 


CPC 


58 63 2018 1482 


3 268 


8519 


MIPS-32: MIPS 4KEc at 184 MHz, Linux 2.6.37 



All times are in nanoseconds per loop iteration, averaged over millions of runs (smaller is better). 

The columns are as follows: 

loop: time per loop iteration (reference); 

call: time per loop iteration calling a direct-style function; 

cps-call: time per loop iteration calling a CPS-converted function; 

switch: context switch; 

cond: context switch on a condition variable; 

spawn: thread creation. 

Fig. 5 Speed of thread primitives on various architectures 

a condition variable are roughly two times slower than pure context switches, which yields 
timings comparable to those of ST. 

Thread creation is similarly fast, since it consists in simply allocating a new continuation 
and registering it with the event loop. This yields results that are ten times faster than ST, 
which must allocate a new stack, and a hundred times faster than the kernel implementation. 



4.2 Macro-benchmarks 

As we have seen above, the speed of CPC primitives ranges from similar to dramatically 
faster than that of their equivalents in traditional implementations. However, a CPC program 
incurs the overhead of the CPS translation, which introduces CPS function calls which are 
much slower than their direct-style equivalents. As it is difficult to predict the number of 
CPS-converted calls in a CPC program, it is not clear whether this performance increase 
will be reflected in realistic programs. 

We have written a set of naive web servers (less than 200 lines each) that share the ex- 
act same structure: a single thread or process waits for incoming connections, and spawns 
a new thread or process as soon as a client is accepted (there is no support for HTTP/1.1 
persistent connections). The servers were benchmarked by repeatedly downloading a tiny 
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file over a dedicated Ethernet with different numbers of simultaneous clients and measur- 
ing the average response time. Because of the simple structure of the servers, this simple, 
repeatable benchmark measures the underlying implementation of concurrency rather than 
the performance tweaks of a production web server. 




Figure [6] presents the results of our experiment. It shows the average serving time per 
request against the number of concurrent requests; a smaller slope indicates a faster server. 
It shows that on this particular benchmark all of our web servers show the expected linear 
behaviour, and that CPC's performance is comparable to that of ST, and much better than 
that of the other implementations. A more in-depth analysis of this benchmark is available 
in a technical report [26|. 



4.3 Hekate 

Benchmarking a BitTorrent seeder is a difficult task because it relies either on a real-world 
load, which is hard to control and only provides seeder-side information, or on an artificial 
testbed, which might fail to accurately reproduce real-world behaviour. We have however 
been able to collect enough empirical evidence to show that Hekate (Section^ is an efficient 
implementation of BitTorrent, lightweight enough to be run on embedded hardware. 

Real-world workload To benchmark the ability of Hekate to sustain a real- world load, we 
need popular torrents with many requesting peers over a long period of time. Updates for 
Blizzard's game World of Warcraft (WoW), distributed over BitTorrent, meet these condi- 
tions: each of the millions of WoW players around the world runs a hidden BitTorrent client, 
and at any time many of them are looking for the latest update. 

We have run an instance of Hekate seeding WoW updates without interruption for 
weeks. We saw up to 1,000 connected peers (800 on average) and a throughput of up to 
lOMB/s (around 5 MB/s on average). Hekate never used more than 10% of the 3.16 GHz 
dual core CPU of our benchmarking machine, and was bottlenecked either by the available 
bandwidth during peaks of requests (10 MB/s), or by the mere fact that we did not gather 
enough peers to saturate the link. 



13 



Stress-test on embedded hardware We have ported Hekate to OpenWr^\, a Linux distribu- 
tion for embedded devices. Hekate runs flawlessly on a cheap home router with a 266 MHz 
MIPS 4Kc CPU, 32 MB of RAM and a 100 Mbps network card. The torrent files were kept 
on a USB flash drive. 

Our stress-test consists in 1,000 simultaneous clients (implemented as a single CPC pro- 
gram running on a fast computer directly connected to the device over 100 Mbps Ethernet) 
requesting random chunks of a 1.2 GB torrent. In these conditions, Hekate was able to serve 
data at a rate of 2.9 MB/s. The CPU load was pegged at 100%, with most of the CPU time 
spent servicing software interrupt requests, notably from the USB subsystem (60 % sirq, the 
usb-storage kernel module using up to 25 % of CPU). These data indicate that even on a 
slow MIPS processor, Hekate's performance is fairly good, and that the performance could 
be much improved by using a device on which mass storage traffic doesn't go over the USB 
bus. 



5 The CPC compilation technique 

The current implementation of CPC is structured into three parts: the CPC to C transla- 
tor, implemented in Objective Caml 1321 on top of the CIL framework I35L the runtime, 
implemented in C, and the standard library, implemented in CPC. The three parts are as in- 
dependent as possible, and interact only through a small set of well-defined interfaces. This 
makes it easier to experiment with different approaches. 

In this section, we present how the CPC translator turns a CPC program in threaded style 
into an equivalent C program written in continuation-passing style. Therefore, we only need 
to focus on the transformations applied to cps functions, ignoring completely the notions 
of CPC thread or CPC primitive which are handled in the runtime part of CPC. We detail 
each step of the CPC compilation technique and the difficulties specifically related to the C 
language. 

5.1 Translation passes 

The CPC translator is structured in a series of source-to-source transformations which turn 
a threaded-style CPC program into an equivalent event-driven C program. This sequence of 
transformations consists of the following passes: 

Boxing a small number of variables needs to be encapsulated in environments to ensure the 

correctness of the later passes; 
Splitting the flow of control of each cps function is split into a set of mutually recursive, 

tail-called, inner functions; 
Lambda-lifting free local variables are copied from one inner function to the next, yielding 

closed inner functions; 

CPS conversion at this point, the program is in CPS-convertible form, a form simple enough 
to perform a one-pass partial conversion into continuation-passing style; the resulting 
continuations are used at runtime by the CPC scheduler. 

The converted program is then compiled by a standard C compiler and linked to the CPC 
scheduler to produce the final executable. 
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All of these passes are well-known compilation techniques for functional programming 
languages, but lambda-lifting and CPS conversion are not correct in general for an impera- 
tive call-by-value language such as C. The problem is that these transformations copy free 
variables: if the original variable is modified, the copy becomes out-of-date and the program 
yields a different result. 

Copying is not the only way to handle free variables. When applying lambda-lifting to a 
call-by-value language with mutable variables, the common solution is to box variables that 
are actually mutated, storing them in environments. However, this is less efficient: adding a 
layer of indirection hinders cache efficiency and breaks a number of compiler optimisations. 
Therefore, CPC strives to avoid boxing as much as possible. 

One key point of the efficiency of CPC is that we need not box every mutated variable 
for lambda-lifting and CPS conversion to be correct. As we show in Section[6] even though 
C is an imperative language, lambda-lifting is correct without boxing for most variables, 
provided that the lifted functions are called in tail position. As it turns out, functions issued 
from the splitting pass are always called in tail position, and it is therefore correct to perform 
lambda-lifting in CPC while keeping most variables unboxed. 

Only a small number of variables, whose addresses have been retained with the "address 
of" operator (&), need to be boxed: we call them extruded variables. Our experiments with 
Hekate show that 50 % of local variables in cps functions need to be lifted. Of that number, 
only 10 % are extruded. In other words, in the largest program written in CPC so far, our 
translator manages to box only 5 % of the local variables in cps functions. 

Sections 15.21 to IB31 present each pass, explain how they interact and why they are cor- 
rect. Although CPS conversion is the last pass performed by the CPC translator, we present 
it first (Section [5.2b because it helps understanding the purpose of the other passes, which 
aim at translating the program into CPS-convertible form; the other passes are then pre- 
sented chronologically. Also note that the correctness of the lambda-lifting pass depends on 
a theorem that will be shown in SectionjS] 



5.2 CPS conversion 

Conversion into Continuation Passing Style [46,33, or CPS conversion for short, is a pro- 
gram transformation technique that makes the flow of control of a program explicit and 
provides first-class abstractions for it. 

Intuitively, the continuation of a fragment of code is an abstraction of the action to 
perform after its execution. For example, consider the following computation: 

f(g(5) + 4); 

The continuation of g(5) is f (• + 4) because the return value of g will be added to 4 and 
then passed to f . 

CPS conversion consists in replacing every function / in a program with a function f* 
taking an extra argument, its continuation. Where / would return with value v, f* invokes 
its continuation with the argument v. A CPS-converted function therefore never returns, but 
makes a call to its continuation. Since all of these calls are in tail position, a converted 
program doesn't use the native call stack: the information that would normally be in the call 
stack (the dynamic chain) is encoded within the continuation. 

This translation has three well-known interesting properties, on which we rely in the 
implementation of CPC: 
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CPS conversion need not be global The CPS conversion is not an "all or nothing" deal, in 
which the complete program must be converted: there is nothing preventing a converted 
function from calling a function that has not been converted. On the other hand, a func- 
tion that has not been converted cannot call a function that has, as it does not have a 
handle to its own continuation. 

It is therefore possible to perform CPS conversion on just a subset of the functions 
constituting a program (in the case of CPC, such functions are annotated with the cps 
keyword). This allows cps code to call code in direct style, for example system calls or 
standard library functions. Additionally, at least in the case of CPC, a cps function call 
is much slower than a direct function call; being able to only convert the functions that 
need the full flexibility of continuations avoids this overhead as much as possible. 

Continuations are abstract data structures The classical definition of CPS conversion im- 
plements continuations as functions. However, continuations are abstract data structures 
and functions are only one particular concrete representation of them. 
The CPS conversion process performs only two operations on continuations: calling a 
continuation, which we call invoke, and prepending a function application to the body 
of a continuation, which we call push. This property is not really surprising: as contin- 
uations are merely a representation for the dynamic chain, it is natural that there should 
be a correspondence between the operations available on a continuation and a stack. 
Since C doesn't have full first-class functions (closures), CPC uses this property to im- 
plement continuations as arrays of function pointers and parameters. 

Continuation transformers are linear CPS conversion introduces linear ("one-shot") con- 
tinuations [5,7]: when a CPS-converted function receives a continuation, it will use it 
exactly once, and never duplicate or discard it. 

This property is essential for memory management in CPC: as CPC uses the C allocator 
(malloc and free) rather than a garbage collector for managing continuations, it allows 
reliably reclaiming continuations without the need for costly devices such as reference 
counting. 

CPS conversion is not defined for every C function; instead, we restrict ourselves to a 
subset of functions, which we call the CPS-convertible subset. As we shall see in Section [5~4l 
every C function can be converted to an equivalent function in CPS-convertible form. 

The CPS-convertible form restricts the calls to cps functions to make it straightforward 
to capture their continuation. In CPS-convertible form, a call to a cps function f is either in 
tail position, or followed by a tail call to another cps function whose parameters are non- 
shared variables, that cannot be modified by f . This restriction about shared variables is 
ensured by the boxing pass detailed in Section l531 

Definition 1 (Extruded and shared variables) Extruded variables are local variables (or 
function parameters) the address of which has been retained using the "address of" operator 
(*)■ 

Shared variables are either extruded or global variables. □ 

Thus, the set of shared variables includes every variable that might be modified by several 
functions called in the same dynamic chain. 

Definition 2 (CPS-convertible form) A function h is in CPS-convertible form if every call 
to a cps function that it contains matches one of the following patterns, where both f and g 
are cps functions, aj , . . . , a n are any C expressions and x , yj , . . . , y n are non-shared 
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□ 



We use return to explicitly mark calls in tail position. The forms ©to © are only neces- 
sary to handle the cases where f and g return void; in the rest of this paper, we ignore these 
cases and focus on the essential cases (1) and (2). 

Early evaluation of non-shared variables To understand why the definition of CPS-convert- 
ible form requires non-shared variables for the parameters of g, consider what happens when 
converting a CPS-convertible function. We use a partial CPS conversion, as explained above, 
focused on tail positions. In a functional language, we would define the CPS conversion as 
follows, where f * is the CPS-converted form of f and k the current continuation: 

return a; — > return k(a) ; 

return f(ai, a n ) ; — > return f*(ai, a n , k) ; 

x = f(a 1; a n ); _^ k' = Ax. g*(x, y lt y n> k) ; 

return g(x, yi , y n ) ; return f*(aj, a n , k'); 

Note the use of the lambda-abstraction Ax . g* (x , yi , . . . , y n , k) to represent the con- 
tinuation of the call to f * in the last case. In a call-by-value language, this continuation can 
be described as: "get the return value of f *, bind it to x, evaluate the variables yi to y n and 
call g* with these parameters." 

Representing this continuation in C raises the problem of evaluating the values of yi to 
y n after the call to f * has completed: these variables are local to the enclosing CPS-converted 
function and, as such, allocated in a stack frame which will be discarded when it returns. To 
keep them available until the evaluation of the continuation, we would need to store them in 
an environment and garbage-collect them at some point. We want to avoid this approach as 
much as possible for performance reasons since it implies an extra layer of indirection and 
extra bookkeeping. 

We use instead a technique that we call early evaluation of variables. It is based on the 
following property: if we can ensure that the variables yi cannot be modified by the function 
f , then it is correct to commute their evaluation with the call to f . Because the CPC translator 
produces code where these variables are not shared, thanks to the boxing pass (Section l573l . 
it is indeed guaranteed that they cannot be modified by a call to another function in the same 
dynamic chain. In CPC, the CPS conversion therefore evaluates these variables when the 
continuation is built, before calling f , and stores their values directly in the continuation. 

We finally define the CPS conversion pass, using early evaluation and the fact mentioned 
above that continuations are abstract data structures manipulated by two operators: push, 
which adds a function to the continuation, and invoke which executes a continuation. 
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Definition 3 (CPS conversion) The CPS conversion translates the tail positions of every 
CPS-convertible term as follows, where f* is the CPS-converted form of f and k is the 
current continuation: 

return a; — > 
return f (aj , . . . , a n ) ; — > 

x = f (a! , . . . , a n ) ; 

return g(x, y t , . . . , y n ) ; 



A proof of correctness of this conversion, including a proof that early evaluation is correct 
in the absence of shared variables, is available in the technical report 1 28 ] . 

Implementation From an implementation standpoint, the continuation k is a stack of func- 
tion pointers and parameters, and push adds elements to this stack. The function invoke 
calls the first function of the stack, with the rest of the stack as its parameters. The form 
invoke (k , a) also takes care of passing the value a to the continuation k; it simply pushes 
a at the top of the continuation before calling its first function. 

The current implementation of push grows continuations by a multiplicative factor when 
the array is too small to contain the pushed parameters, and never shrinks them. While 
this might in principle waste memory in the case of many long-lived continuations with an 
occasional deep call stack, we believe that this case is rare enough not to justify complicating 
the implementation with this optimisation. 

The function push does not align parameters on word boundaries, which leads to smaller 
continuations and easier store and load operations. Although word-aligned reads and writes 
are more efficient in general, our tests showed little or no impact in the CPC programs we 
experimented with, on x86 and x86-64 architectures: the worst case has been a 10 % slow- 
down in a micro-benchmark with deliberately misaligned parameters. We have reconsidered 
this trade-off when porting CPC to MIPS, an architecture with no hardware support for un- 
aligned accesses, and added a compilation option to align continuation frames. However 
we keep this option disabled by default, until we have more experimental data to assess its 
efficiency. 

There is one difference between Definition[3]and the implementation. In a language with 
proper tail calls, each function would simply invoke the continuation directly; in C, which 
is not properly tail-recursive, doing that leads to unbounded growth of the native call stack. 
Therefore, the tail call return invoke (k) cannot be used; we work around this issue by 
using a "trampolining" technique 1191 . Instead of calling invoke, each cps function returns 
its continuation: return k. The main event loop iteratively receives these continuations and 
invokes them until a CPC primitive returns NULL, which yields to the next CPC thread. 

In the following sections, we show how the boxing, splitting and lambda-lifting passes 
translate any CPC program into CPS-convertible form. 



5.3 Boxing 

Boxing is a common, straightforward technique to encapsulate mutable variables. It is neces- 
sary to ensure the correctness of the CPS conversion and lambda-lifting passes (Sections l5.2l 



return invoke (k, a); 

push (k, f*, aj , a n ) ; 

return invoke (k) ; 

push (k, g*, yi y n ); 

push (k, f*, aj , a n ) ; 

return invoke (k) ; 

□ 
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and 15.5b . However, boxing induces an expensive indirection to access boxed variables. In 
order to keep this cost at an acceptably low level, we box only a subset of all variables — 
namely extruded variables, whose address is retained with the "address of" operator & (see 
Definition Q] p.[T5ll. 

Example Consider the following function: 

cps void f(int x) { 
int y = 0; 

int *pl = &x, *p2 = &y; 

/* ... */ 

return; 

} 

The local variables x and y are extruded, because their address is stored in the pointers pi 
and p2. The boxing pass allocates them on the heap at the beginning of the function f and 
frees them before f returns. For instance, in the next program every occurrence of x has been 
replaced by a pointer indirection *px, and &x by the pointer px. 

cps void f(int x) { 

int *px = malloc(sizeof (int) ) ; 

int *py = malloc(sizeof (int) ) ; 

*px = x; /* Initialise px */ 

*py = 0; 

int *pl = px, *p2 = py; 

/* ... with x and y replaced accordingly */ 

free(px); free(py); 

return; 

} 

The extruded variables x and y are not used anymore (except to initialise px). Instead, px 
and py are used; note that these variables, contrary to x and y, are not extruded: they hold 
the address of other variables, but their own address is not retained. After the boxing pass, 
there are no more extruded variables used in cps functions. 

Cost analysis The efficiency of CPC relies in great part on avoiding boxing as much as pos- 
sible. Performance-wise, we expect boxing only extruded variables to be far less expensive 
than boxing every lifted variable. Indeed, in a typical C program, few local variables have 
their address retained compared to the total number of variables. 

Experimental data confirm this intuition: in Hekate, the CPC translator boxes 13 vari- 
ables out of 125 lifted parameters. This result is obtained when compiling Hekate with the 
current CPC implementation. They take into account the fact that the CPC translator tries 
to be smart about which variables should be boxed or lifted: for instance, if the address of 
a variable is retained with the "address of" operator & but never used, this variable is not 
considered as extruded. Using a naive implementation, however, does not change the pro- 
portion of boxed variables: with optimisations disabled, 29 variables are boxed out of 323 
lifted parameters. In both cases, CPC boxes about 10 % of the lifted variables, which is an 
acceptable overhead. 
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Interaction with other passes The boxing pass yields a program without the "address of" 
operator (&); extruded variables are allocated on the heap, and only pointers to them are 
copied by the lambda-lifting and CPS-conversion passes rather than extruded variables 
themselves. One may wonder, however, whether it is correct to perform boxing before ev- 
ery other transformation. It turns out that boxing does not interfere with the other passes, 
because they do not introduce any additional "address of" operators. The program therefore 
remains free of extruded variables. Moreover, it is preferable to box early, before introducing 
inner functions, since it makes it easier to identify the entry and exit points of the original 
function, where variables are allocated and freed. 



Extruded variables and tail calls Although we keep the cost of boxing low, with about 
10 % of boxed variables, boxing has another, hidden cost: it breaks tail recursive cps calls. 
Since the boxed variables might, in principle, be used during the recursive calls, one cannot 
free them beforehand. Therefore, functions featuring extruded variables do not benefit from 
the automatic elimination of tail recursive calls induced by the CPS conversion. While this 
prevents CPC from optimising tail recursive calls "for free", it is not a real limitation: the C 
language does not ensure the elimination of tail recursive calls anyway, as the stack frame 
should similarly be preserved in case of extruded variables, and C programmers are used not 
to rely on it. 



5.4 Splitting 

To transform a CPC program into CPS-convertible form, the CPC translator needs to ensure 
that every call to a cps function is either in tail position or followed by a tail call to another 
cps function. In the original CPC code, calls to cps functions might, by chance, respect 
this property but, more often than not, they are followed by some direct-style (non-cps) 
code. The role of the CPC translator is therefore to replace this direct-style chunk of code 
following a cps call by a tail call to a cps function which encapsulates this chunk. We call 
this pass splitting because it splits each original cps function into many, mutually recursive, 
cps functions in CPS-convertible form. 

To reach CPS-convertible form, the splitting pass must introduce tail calls after every 
existing cps call. This is done in two steps: we first introduce a goto after every existing cps 
call (Section r5.4.1l l. then we translate these goto into tail calls (Section |5A2]|. 

The first step consists in introducing a goto after every cps call. Of course, to keep the 
semantics of the program unchanged, this inserted goto must jump to the statement follow- 
ing the tail call in the control-flow graph: it makes the control flow explicit, and prepares the 
second step which translates these goto into tail calls. In most cases, the control flow falls 
through linearly and inserting goto statements is trivial; as we shall see in Section [5 .4. II 
more care is required when the control flow crosses loops and labelled blocks. This step 
produces code which resembles CPS-convertible form, except that it uses goto instead of 
tail calls. 

The second step is based on the observation that tail calls are equivalent to jumps. We 
convert each labelled block into an inner function, and each goto statement into a tail call 
to the associated function, yielding a program in CPS-convertible form. 

We detail these two steps in the rest of this section. 
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5.4.1 Explicit flow of control 

When a cps call is not in tail position, or followed by a tail cps call, the CPC translator adds 
a goto statement after it, to jump explicitly to the next statement in the control-flow graph. 
These inserted goto are to be converted into tail cps calls in the next step. 

In most cases, the control flow falls through linearly, and making it explicit is trivial. For 
instance, 

cpc_yield(); rc = 0; 

becomes 

cpc_yield() ; goto 1; 
1: rc = 0; 

However, loops and conditional jumps require more care: 

while (! timeout) { 

int rc = cpc_read(); 
if(rc <= 0) break; 
cpc_write() ; 

} 

reset_timeout() ; 

is converted to: 

while_label : 
if (! timeout) { 

int rc = cpc_read() ; goto 1; 
1: 

if(rc <= 0) goto break_label; 
cpc_write(); goto while_label; 

> 

break_label : 

reset_timeout() ; 

More generally, when the flow of control after a cps call goes outside of a loop (for, while 
or do ... while) or a switch statement, the CPC translator simplifies these constructs into 
if and goto statements, adding the necessary labels on the fly. 

Although adding trivial goto, and making loops explicit, brings the code in a shape close 
to CPS -convertible form, we need to add some more goto statements for the second step to 
correctly encapsulate chunks of code into cps functions. Consider for instance the following 
piece of code: 

if(rc < 0) { 

cpc_yield() ; rc = 0; 

} 

printfC'rc = 7„d\n" , rc) ; 
return rc ; 

With the rules described above, it would be translated to: 

if(rc < 0) { 

cpc_yield() ; goto 1; 
1: { rc = 0; > 
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} 

printf ("rc = °/,d\n" , rc) ; 
return rc ; 

This is not enough because the block labelled by 1 will be converted to a cps function in the 
second step. But if we encapsulate only the {rc = ; } part, we will miss the call to printf 
when rc < 0. Therefore, to ensure a correct encapsulation in the second step, we also need 
to make the flow of control explicit when exiting a labelled block. The example becomes: 

if(rc < 0) { 

cpc_yield() ; goto 1; 

1: { rc = 0; goto done; }■ 

> 

done : 

printf ("rc = °/,d\n" , rc) ; 
return rc ; 

After this step, every cps call is either in tail position or followed by a goto statement, 
and every labelled block exits with either a return or a goto statement. 

5.4.2 Eliminating gotos 

It is a well-known fact in the compiler folklore that a tail call is equivalent to a goto. It is 
perhaps less known that a goto is equivalent to a tail call [ 45 , 54 1 : the block of any destination 
label 1 is encapsulated in an inner function 1(), and each goto 1; is replaced by a tail call 
to 1. 

Coming back to the first example of SectionfSAT] the label 1 yields a function 1(): 

f () ; return 1() ; 

cps void 1() { rc = 0; } 

We see that the first line is now in CPS-convertible form. Note again that we use return to 
mark tail calls explicitly in the C code. 

Applying the same conversion to loops yields mutually recursive functions. For in- 
stance, the while loop in the second example is converted into while_label(), 10 and 
break_label() : 

while_labelO ; 
cps void while_label () { 
if (! timeout) ■[ 

int rc = cpc_read() ; return 10; 
cps void 1() { 

if (rc <= 0) return break_label() ; 
cpc_write(); return while_label() ; 

} 

} 

} 

cps void break_label() { 
reset_timeout () ; 

} 
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When a label, like while_label above, is reachable not only through a goto, but also di- 
rectly through the linear flow of the program, it is necessary to call its associated function at 
the point of definition; this is what we do on the first line of this example. 

Note that splitting may introduce free variables; for instance, in the previous example, rc 
is free in the function 1. In this intermediate C-like code, inner functions are considered just 
like any other C block when it comes to the scope and lifespan of variables: each variable 
lives as long as the block or function which defines it, and it can be read and modified by 
any instruction inside this block, including inner functions. There is in particular no copy 
of free variables in inner functions; variables are shared inside their block. In the previous 
example, the lifespan of the variable rc is the if block, including the call to the function 1 
which reads the free variable allocated in its enclosing function while_label. 

The third example shows the importance of having explicit flow of control at the end of 
labelled blocks. After goto elimination, it becomes: 

if(rc < 0) { 

cpc_yield() ; return 10; 

cps int 1() { rc = 0; return doneO; } 

} 

cps int doneO { 

printf ("rc = "/.d\n" , rc) ; 
return rc ; 

} 

return doneO ; 

Note the tail call to done at the end of 1 to ensure that the printf is executed when rc < 0, 
and again the call on the last line to execute it otherwise. 

5.4.3 Implementation 

The actual implementation of the CPC translator implements splitting on an as-needed basis: 
a given subtree of the abstract syntax tree (AST) is only transformed if it contains CPC 
primitives that cannot be implemented in direct style. Our main concern here is to transform 
the code as little as possible, on the assumption that the gcc compiler is optimised for human- 
written code. 

To perform splitting, the CPC translator iterates over the AST, checking whether the 
cps functions are in CPS-convertible form and interleaving the two steps described above 
to reach CPS-convertible form incrementally. On each pass, when the translator finds a cps 
call it dispatches on the statement following it: 

- in the case of a return, the fragment is already CPS-convertible and the translator con- 
tinues; 

- in the case of a goto, it is converted into a tail call, with the corresponding label turned 
into an inner function (Section[5A2]l, and the translator starts another pass; 

- for any other statement, a goto is added to make the flow of control explicit, converting 
enclosing loops too if necessary (Section[5AT|i. The translator then starts another pass 
and will eventually convert the introduced goto into a tail call. 

At the end of the splitting pass, the translated program is in CPS-convertible form. How- 
ever, it is not quite ready for CPS conversion because we introduced inner functions, which 
makes it invalid C. In particular, these functions may contain free variables. The next pass, 
lambda-lifting, takes care of these free variables to get a valid C program in CPS-convertible 
form, suitable for the CPS conversion pass described in Section l5^2l 
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5.5 Lambda-lifting 

Lambda-lifting [25] is a standard technique to eliminate free variables in functional lan- 
guages. It proceeds in two phases 1111 . In the first pass ("parameter lifting"), free variables 
are replaced by local, bound variables. In the second pass ("block floating"), the resulting 
closed functions are floated to top-level. 

Example Coming back to the latest example, we add an enclosing function f to define rc 
and make the fragment self-contained: 

cps int f(int rc) { 
ifCrc < 0) { 

cpc_yield() ; return 10; 

cps int 1() { rc = 0; return done() ; } 

} 

cps int done() { 

printf("rc = °/,d\n" , rc) ; 
return rc ; 

} 

return done() ; 

y 

The function f contains two inner functions, 1 and done. The local variable rc is used as a 
free variable in both of these functions. 

Parameter lifting consists in adding the free variable as a parameter of every inner func- 
tion: 

cps int f(int rc) { 
if(rc < 0) { 

cpc_yield() ; return l(rc) ; 

cps int l(int rc) { rc = 0; return done(rc); } 

} 

cps int done (int rc) { 
printf("rc = °/.d\n" , rc) ; 
return rc ; 

} 

return done(rc) ; 

y 

Note that rc is now a parameter of 1 and done, and has been added accordingly whenever 
these functions where called. There are, now, three copies of rc; alpha-conversion makes 
this more obvious: 

cps int f(int rcl) { 
if(rcl < 0) { 

cpc_yield() ; return l(rcl); 

cps int l(int rc2) { rc2 = 0; return done(rc2); } 

} 

cps int done (int rc3) { 
printfC'rc = 7,d\n" , rc3) ; 
return rc3; 
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} 

return done(rcl); 

} 

Once the parameter lifting step has been performed, there is no free variable anymore and 
the block floating step consists in extracting these inner, closed functions at top-level: 

cps int f(int rcl) { 
if(rcl < 0) { 

cpc_yield(); return l(rcl); 

} 

return done(rcl); 

} 

cps int l(int rc2) { 
rc2 = 0; 

return done(rc2); 

} 

cps int done (int rc3) { 
printfC'rc = 7.d\n" , rc3) ; 
return rc3; 

} 

Applying boxing, splitting and lambda-lifting always yields CPS-convertible programs: 

- every call to a cps function is either a tail call (not affected by the transformations) or 
followed by a tail cps call (introduced in the splitting pass), 

- the parameters of this second cps call are local variables, since they are introduced by 
the lambda-lifting pass, 

- these parameters are not shared because they are neither global (local variables) nor 
extruded (the boxing pass ensures that there are no more extruded variables in the pro- 
gram). 

Lambda-lifting in imperative languages There is one major issue with applying lambda- 
lifting to C extended with inner functions: this transformation is in general not correct in a 
call-by-value languages with mutable variables. 

Consider what would happen if splitting were to yield the following code: 

cps int f(int rc) { 

cps void setO { rc = 0; return; } 
cps void doneO { 

printfC'rc = 7„d\n" , rc) ; 

return; 

} 

setO; return doneO; 

} 

This code, which is in CPS-convertible form, prints out rc = whatever the original value 
of rc was: the call to set sets rc to and the call to done displays it. 
Once lambda-lifted, this code becomes: 

cps int f(int rcl) { 

set (rcl); return done (rcl); 
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} 

cps void set(int rc2) { 
rc2 = 0; 
return; 

} 

cps void done(int rc3) { 
printfC'rc = "/„d\n" , rc3) ; 
return; 

} 

The result has changed: the code now displays the original value of rc passed to f rather 
than 0. The reason why lambda-lifting is incorrect in that case is because set and done 
work on two separate copies rc, rc2 and rc3, whereas in the original code there was only 
one instance of rc shared by all inner functions. 

This issue is triggered by the fact that the function set is not called in tail position. This 
non-tail call allows us to observe the incorrect value of rc3 after set has modified the copy 
rc2 and returned. If set were called in tail position instead, the fact that it operates on a 
copy of rcl would remain unnoticed. 

Lambda-lifting and tail calls In fact, although lambda-lifting is not correct in general in a 
call-by-value language with mutable variables, it becomes correct once restricted to func- 
tions called in tail position and non-extruded parameters. More precisely, in the absence of 
extruded variables, it is safe to lift a parameter provided every inner function where this pa- 
rameter is lifted is only called in tail position. We show this result in Section[6](Theorem[T] 
p. [H. 

Inner functions in CPC are the result of goto elimination during the splitting step. As 
a result, they are always called in tail position. Moreover, as explained in Section 15.31 the 
boxing pass ensures that extruded variables have been eliminated at this point of the trans- 
formation. Hence, lambda-lifting is correct in the case of CPC. 

Implementation To lift as few variables as possible, lambda-lifting is implemented incre- 
mentally. Rather than lifting every parameter, the translators looks for free variables to be 
lifted in a cps function and adds them as parameters at call points; this creates new free 
variables, and the translator iterates until it reaches a fixed point. 

This implementation might be further optimised with a liveness analysis, which would 
in particular avoid lifting uninitialised parameters. The current translator performs a very 
limited analysis: only variables which are used (hence alive) in a single function are not 
lifted. 

Experimental results The common technique to use lambda-lifting in an imperative lan- 
guage is to box every mutated variable, in order to duplicate pointers to these variables 
instead of the variables themselves. To quantify the amount of boxing avoided by our tech- 
nique of lambda-lifting tail-called functions, we used a modified version of CPC which 
blindly boxes every lifted parameter and measure the amount of boxing that it induced in 
Hekate, the most substantial program written with CPC so far. 

Hekate contains 260 local variables and function parameters, spread across 28 cps func- 
tionf[fl Among them, 125 variables are lifted. A naive lambda-lifting pass would therefore 
need to box almost 50 % of the variables. 



These numbers leave out direct-style functions, which do not need to be converted, and around 200 
unused temporary variables introduced by a single pathological macro-expansion in the curl library. 
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On the other hand, boxing extruded variables only carries a much smaller overhead: in 
Hekate, the current CPC translator boxes only 5 % of the variables in cps functions. In other 
words, 90 % of the lifted variables in Hekate are safely left unboxed, keeping the overhead 
associated with boxing to a reasonable level. 



6 Lambda-lifting in an imperative language 

To prove the correctness of lambda-lifting in an imperative, call-by-value language when 
functions are called in tail position, we do not reason directly on CPC programs, because 
the semantics of C is too broad and complex for our purposes. The CPC translator leaves 
most parts of converted programs intact, transforming only control structures and function 
calls. Therefore, we define a simple language with restricted values, expressions and terms, 
that captures the features we are most interested in (Section RTTT i. 

The reduction rules for this language (Section 16.1. lb use a simplified memory model 
without pointers and enforce that local variables are not accessed outside of their scope, as 
ensured by our boxing pass. This is necessary since we have seen that lambda-lifting is not 
correct in general in the presence of extruded variables. 

It turns out that the "naive" reduction rules defined in Section 16.1.11 do not provide 
strong enough invariants to prove this correctness theorem by induction, mostly because 
we represent memory with a store that is deeply modified by lambda-lifting. Therefore, in 
Section [6\2l we define an equivalent, "optimised" set of reduction rules which enforces more 
regular stores and closures. 

The proof of correctness is then carried in Section l63l using these optimised rules. We 
first define the invariants needed for the proof and formulate a strengthened version of the 
correctness theorem (Theorem [3] Section [6.3.1b . A comprehensive overview of the proof is 
then given in Section [633] The proof is fully detailed in Section [633] with the help of a 
few lemmas to keep the main proof shorter (Sections 16. 3.3l and l6.3.4b . 

The main limitation of this proof is that Theorems [T] and [3] are implications, not equiv- 
alences: we do not prove that if a term does not reduce, it will not reduce once lifted. For 
instance, this proof does not ensure that lambda-lifting does not break infinite loops. 



6.1 Definitions 

In this section, we define the terms (Definition^, the reduction rules (Section [6.1.1b and the 
lambda-lifting transformation itself (Section [67TT2J for our small imperative language. With 
these preliminary definitions, we are then able to characterise liftable parameters (Defini- 
tionll lb and state the main correctness theorem (Theorem[T] Section [6.1.3b . 

Definition 4 (Values, expression and terms) 

Values are either boolean and integer constants or 1, a special value for functions return- 
ing void. 

v ::= 1 | true | false | n e N 

Expressions are either values or variables. We deliberately omit arithmetic and boolean 
operators, with the sole concern of avoiding boring cases in the proofs. 



expr ::= v\x\ 



27 



Terms are made of assignments, conditionals, sequences, recursive functions definitions 
and calls. 

T ::=expr \x:=T \ if T then T else T \ T ; T 
| letrec f(x x , . . . , x n ) = T in T | f(T, ...,T) 

□ 

Our language focuses on the essential details affected by the transformations: recursive func- 
tions, conditionals and memory accesses. Loops, for instance, are ignored because they can 
be expressed in terms of recursive calls and conditional jumps — and that is, in fact, how 
the splitting pass translates them (Section [54t . Since lambda-lifting happens after the split- 
ting pass, our language need to include inner functions (although they are not part of the C 
language), but it can safely exclude goto statements. 

One important simplification of this language compared to C is the lack of pointers. 
However, remember that we are lifting only local, stack-allocated variables, and that these 
variables cannot be accessed outside of their scope, as ensured by our boxing pass. Since we 
get rid of the "address of" operator &, pointers remaining in CPC code after boxing always 
point to the heap, never to the stack. Adding a heap and pointers to our language would only 
make it larger without changing the proof of correctness. 



6.1.1 Naive reduction rules 



Environments and stores Handling inner functions requires explicit closures in the reduction 
rules. We need environments, written p, to bind variables to locations, and a store, written 
s, to bind locations to values. 

Environments and stores are partial functions, equipped with a single operator which 
extends and modifies a partial function: ■ + {■ h-> ■}. 

Definition 5 The modification (or extension) f of a partial function /, written /' = / + 
{jc i-> y}, is defined as follows: 



fit) 



y when t = x 
f(t) otherwise 
dom(/')=dom(/)U{*} 

□ 

Definition 6 (Environments of variables and functions) Environments of variables are 
defined inductively by 

p ::=£ | {x,l)-p, 
i.e. the empty domain function and p + {x h-» /} (respectively). 

Environments of functions, on the other hand, associate function names to closures: 

J?:{f,g,h,...}^{[X Xl ,...,x n .T,p,&}}. 

a 

Note that although we have a notion of locations, which correspond roughly to memory 
addresses in C, there is no way to copy, change or otherwise manipulate a location directly 
in the syntax of our language. This is on purpose, since adding this possibility would make 
lambda-lifting incorrect: it translates the fact, ensured by the boxing pass in the CPC trans- 
lator, that there are no extruded variables in the lifted terms. 
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p x = I e dom s 

(VAL) (VAR) 



(V,,) -J-> <v,i) ^ M~J-> (*'.') 

M-^+M p* = /edom/ ( flj ,)_^( v y) < M ')-^ (,/,/') 

(ASSIGN) (SEQ) = 

<.v := a,s) — (iy + {/ ^ v}) (a ; M — (/,/') 

(a,*) (true,/) (fc,/) -?->• (v,/) (a,*)-^ (false,/) (c,/) (v,/) 

(IF-T.) — — (IF-F.) — 

(if a then else c,s) — (v,/') (if a then & else c,s) — (v,s") 



(LETREC) ■ 



jT' = jr + {/ 1-> [A*i , . . . ,x n .a, p, } 



(letrec f(xj ,x n ) = a in b.s) — — > ( v i*') 

&f= [Xxi,...,x n .b,p',&'] p" = (x l J l )-...-(x„,l„) /, fresh and distinct 

Vi.{a„s,) (vi,s i+l ) (b,s n +i + {li i ^ v,}) ^' P ' > (v/) 

(CALL) — 



</( flI ,..., fl „),.v,)^(v,,') 



Fig. 7 "Naive" reduction rules 



Reduction rules We use classical big-step reduction rules for our language (Figure[7] p. 1281). 

In the (call) rule, we need to introduce fresh locations for the parameters of the called 
function. This means that we must choose locations that are not already in use, in particular 
in the environments p' and & . To express this choice, we define two ancillary functions, 
Env and Loc, to extract the environments and locations contained in the closures of a given 
environment of functions j£ ' . 

Definition 7 (Set of environments, set of locations) 

Env(^) = U{p,p'| [Xx h ...,x n .M,p,^ r ] 6lm(^),p'eEnv(^')} 

Loc(j?) = lj{lm(p) | p e Env(^)} 
A location / is said to appear in & iff / 6 Loc(j^) . 

□ 

These functions allow us to define fresh locations. 

Definition 8 (Fresh location) In the (call) rule, a location is fresh when: 

- / fj£ dom( 1 s„ + i), i.e. Z is not already used in the store before the body of / is evaluated, 
and 

- / doesn't appear in JP' + {/ h-> & f}, i-e- ' will not interfere with locations captured in 
the environment of functions. 

□ 

Note that the second condition implies in particular that / does not appear in either & or p'. 
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6.1.2 Lambda-lifting 

We mentioned in Section [531 that lambda-lifting can be split into two parts: parameter lifting 
and block floating. We will focus only on the first part here, since the second one is trivial. 
Parameter lifting consists in adding a free variable as a parameter of every inner function 
where it appears free. This step is repeated until every variable is bound in every function, 
and closed functions can safely be floated to top-level. Note that although the transformation 
is called lambda-lifting, we do not focus on a single function and try to lift all of its free 
variables; on the contrary, we define the lifting of a single free parameter x in every possible 
function. 

Usually, smart lambda-lifting algorithms strive to minimize the number of lifted vari- 
ables. Such is not our concern in this proof: parameters are lifted in every function where 
they might potentially be free. (In our implementation, the CPC translator actually uses a 
smarter approach to avoid lifting too many parameters, as explained in Section [531 ) 

Definition 9 (Parameter lifting in a term) Assume that x is defined as a parameter of a 
given function g, and that every inner function in g is called hi (for some i 6 N). Also assume 
that function parameters are unique before lambda-lifting. 

Then, the lifted form (M) t of the term M with respect to x is defined inductively as follows: 



6.1.3 Correctness condition 

We claim that parameter lifting is correct for variables defined in functions whose inner 
functions are called exclusively in tail position. We call these variables liftable parameters. 
We first define tail positions as usual 1 10 1: 

Definition 10 (Tail position) Tail positions are defined inductively as follows: 

1. M and N are in tail position in if P then M else N. 

2. N is in tail position in N and M ; N and letrec f(xi, . . . ,x„) = M in N. 



A parameter x defined in a function g is liftable if every inner function in g is called exclu- 
sively in tail position. 

Definition 11 (Liftable parameter) A parameter x is liftable in M when: 
- x is defined as the parameter of a function g, 



(1), = 1 (»), = * 
[true\ = true (false)^ = false 
M* = y an d (y '■= fl )» =y '■= (even if y = x) 

(a ; b\ = (a)„ ; (*)„ 
(if a then b else c)„ =if (a), then else (c)„ 




J letrec f(x\,. . . ,x n ,x) = (a) 4 in (b)^ if /= h, 
) letrec /(jci,...,jc„) = (a)„ in otherwise 

. . . , {a„) t ,x) if /= hi for some i 



ai) 4 ,...,(o„)J otherwise 



□ 



□ 
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- inner functions in g, named h,, are called exclusively in tail position in g or in one of the 

□ 

Our main theorem is that performing parameter-lifting on a liftable parameter preserves the 
reduction: 

Theorem 1 (Correctness of lambda-lifting) Ifx is a liftable parameter in M, then 
3t, (M,e) -j* (v,t) implies 3t', ((M)»,e) -j* (v,t') . 

Note that the resulting store t' changes because lambda-lifting introduces new variables, 
hence new locations in the store, and changes the values associated with lifted variables; 
Section [o31 is devoted to the proof of this theorem. To maintain invariants during the proof, 
we need to use an equivalent, "optimised" set of reduction rules; it is introduced in the next 
section. 



6.2 Optimised reduction rules 

The naive reduction rules (Section 16.1.1b are not well-suited to prove the correctness of 
lambda-lifting. Indeed, the proof is by induction and requires a number of invariants on the 
structure of stores and environments. Rather than having a dozen of lemmas to ensure these 
invariants during the proof of correctness, we translate them as constraints in the reduction 
rules. 

To this end, we introduce two optimisations — minimal stores (Section r6.2.1b and com- 
pact closures (Section[6jZ2]l — which lead to the definition of an optimised set of reduction 
rules (Figure[8] Section l6.2.3b . The equivalence between optimised and naive reduction rules 
is shown in the technical report [28 1. 

6.2.1 Minimal stores 

In the naive reduction rules, the store grows faster when reducing lifted terms, because each 
function call adds to the store as many locations as it has function parameters. This yields 
stores of different sizes when reducing the original and the lifted term, and that difference 
cannot be accounted for locally, at the rule level. 

Consider for instance the simplest possible case of lambda-lifting: 

letrec g(x) = (letrec h() = x in hQ) in g(l) (original) 
letrec g(x) = (letrec h(y) =y in h(x)) in g(l) (lifted) 

At the end of the reduction, the store for the original term is {l x h-> 1} whereas the store for 
the lifted term is {l x i-^ l;ly i-> 1}. More complex terms would yield even larger stores, with 
many out-of-date copies of lifted variables. 

To keep the store under control, we need to get rid of useless variables as soon as possible 
during the reduction. It is safe to remove a variable x from the store once we are certain that 
it will never be used again, i.e. as soon as the term in tail position in the function which 
defines x has been evaluated. This mechanism is analogous to the deallocation of a stack 
frame when a function returns. 
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To track the variables whose location can be safely reclaimed after the reduction of 
some term M, we introduce split environments. Split environments are written pr\p, where 
Pt is called the tail environment and p the non-tail one; only the variables belonging to the 
tail environment may be safely reclaimed. The reduction rules build environments so that a 
variable x belongs to pj if and only if the term M is in tail position in the current function 
/ and x is a parameter of /. In that case, it is safe to discard the locations associated to all 
of the parameters of /, including x, after M has been reduced because we are sure that the 
evaluation of / is completed (and there is no first-class functions in the language to keep 
references on variables beyond their scope of definition). 

We also define a cleaning operator, • \ •, to remove a set of variables from the store. 

Definition 12 (Cleaning of a store) The store s cleaned with respect to the variables in p, 
written s\p, is defined as s \p = *|dom(j)\im(p)- d 

6.2.2 Compact closures 

Another source of complexity with the naive reduction rules is the inclusion of useless vari- 
ables in closures. It is safe to remove from the environments of variables contained in clo- 
sures the variables that are also parameters of the function: when the function is called, and 
the environment restored, these variables will be hidden by the freshly instantiated parame- 
ters. 

This is typically what happens to lifted parameters: they are free variables, captured in 
the closure when the function is defined, but these captured values will never be used since 
calling the function adds fresh parameters with the same names. We introduce compact 
closures in the optimised reduction rules to avoid dealing with this hiding mechanism in the 
proof of lambda-lifting. 

A compact closure is a closure that does not capture any variable which would be hidden 
when the closure is called because of function parameters having the same name. 

Definition 13 (Compact closure and environment) A closure [Ajci,. . . ,x„.M,p,JP] is a 
compact closure if Vf,JCj ^ dom(p) and & is compact. An environment is compact if it 
contains only compact closures. □ 

We define a canonical mapping from any environment & to a compact environment J^*, 
restricting the domains of every closure in & . 

Definition 14 (Canonical compact environment) The canonical compact environment ^„ 
is the unique environment with the same domain as & such that 



V/edom(^),^/= [Ajci , . . . , 
implies f = \Xx\ 



x„.M, Pj .^'] 

^n- W 5 Pldom(p)\{.v 1 ,...,j:„}^'*] ■ 



□ 



6.2.3 Optimised reduction rules 



Combining both optimisations yields the optimised reduction rules (Figure[8] p. [32}, used in 
Section l631 for the proof of lambda-lifting. 
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Pr ■ p x = l e dom , 

(VAL) : (VAR) 



(v, s) ==> {v, s\pr) (x, s) ==* {s l,s\pr) 

(a, s ) J^U <v/) p r p* = /e dom,' <g,j) (v/) (/,,,') (v',s") 

(ASSIGN) : (SEQ) : 

( x := a,s) =^L> (1^ + {/ _> v }\pr) <fl ; M ==> (v',.v") 

(a, S ) (true,,') <v,,") (a,s) (false,,') (,,,') A (v,,") 

(IF-T.) ■ (IF-F.) 

(if a then b else c,s) ==> (v,,") (if a then b else c,,) ===*• {v,s") 

{ b,s)A( V ,s') 

(letrec) p'^Pr-pldcmtpy-pA^,...^} &'=& + {f^ [\x u ...,x n .a,p',&]} 
(letrec f(x\....,x„) = a in 2>,s) Pi ^ P ;• (v,,') 

&f= \Xxi,...,x n .b,p' ',&'] p" = (xi,h) ■ ... ■ (x n ,l n ) /, fresh and distinct 

Vi, (a;,*,) |P ^ P : • (v;,s;+i) (b,s„+i + {h >-> V;}) ======» (v,,') 

(CALL) — 

(/(ai,...,a„),si) 1 : ■ (v,,'\p r ) 



Fig. 8 Optimised reduction rules 



Consider for instance the rule (seq). 



(SEQ) ■ 



The environment of variables is split into the tail environment, pr, and the non-tail one, p. 
This means that a ; is in tail position in a function whose parameters are the variables of 
p T . When we reduce the left part of the sequence, a, we track the fact that it is not in tail 

position in this function by moving pj to the non-tail environment: (a,s) > (v,s ). On 

the other hand, when we reduce b, we are in the tail of the term and the environment stays 
split. 

As detailed above, we have introduced split environments in order to ensure minimal 
stores. Stores are kept minimal in the rules corresponding to tail positions, the leaves of the 
reduction tree: (val), (var) and (assign). In these three rules, variables that appear in the tail 
environment are cleaned from the resulting store: s\pj. 

Finally, the (letrec) and (call) rules are modified to introduce compact closures and split 
environments, respectively. Compact closures are built in the (letrec) rule by removing the 
parameters of / from the captured environment p'. In the (call) rule, environments are split 
in a tail part, which contains local variables of the called function, and a non-tail part, which 
contains captured variables; only the former must be cleaned when the tail instruction of the 
function is reduced. 



Theorem 2 (Equivalence between naive and optimised reduction rules) Optimised and 
naive reduction rules are equivalent: every reduction in one set of rules yields the same 
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result in the other. It is necessary, however, to take care of locations left in the store by the 
naive reduction: 

( M ,£)=^(v,£) iff 3s,(M,e)-U( V ,s) 
The proof of this theorem is detailed in the technical report [28|. 

6.3 Correctness of lambda-lifting 

In this section, we prove the correctness of lambda-lifting (Theorem [T] p. 130b by induction 
on the height of the optimised reduction. 

Section l6. 3 . 1 I defines stronger invariants and rewords the correctness theorem with them. 
Section [6.3.2| gives an overview of the proof. Sections [6.3.3l and l6.3.4| prove a few lemmas 
needed for the proof. Section [6.3.5l contains the actual proof of correctness. 

6.3.1 Strengthened hypotheses 

We need strong induction hypotheses to ensure that key invariants about stores and environ- 
ments hold at every step. For that purpose, we define aliasing-free environments, in which 
locations may not be referenced by more than one variable, and local positions. They yield 
a strengthened version of liftable parameters (Definition \T7\ . We then define lifted envi- 
ronments (Definition UM to mirror the effect of lambda-lifting in lifted terms captured in 
closures, and finally reformulate the correctness of lambda-lifting in Theorem [3] with hy- 
potheses strong enough to be provable directly by induction. 

Definition 15 (Aliasing) A set of environments S is aliasing-free when: 

Vp,p' e S , Vjc € dom(p),Vy e dom(p'), p x = p' y =>x = y. 

By extension, an environment of functions is aliasing-free when Env(^) is aliasing-free. 

□ 

The notion of aliasing-free environments is not an artifact of our small language, but trans- 
lates a fundamental property of the C semantics: distinct function parameters or local vari- 
ables are always bound to distinct memory locations (Section 6.2.2, paragraph 6 in ISO/IEC 
9899 (24)). 

A local position is any position in a term except inner functions. Local positions are used 
to distinguish functions defined directly in a term from deeper nested functions, because we 
need to enforce Invariant [31 (Definition 117b on the former only. 

Definition 16 (Local position) Local positions are defined inductively as follows: 

1. M is in local position in M, x := M, M ; M, if M then M else M and f(M, ...,M). 

2. N is in local position in letrec f{x\ , . . . ,x n ) =M in N. 

□ 

We extend the notion of liftable parameter (Definition[TT] p. 129b to enforce invariants on 
stores and environments. 

Definition 17 (Extended liftability) The parameter x is liftable in (M,^,pr,p) when: 
1. jc is defined as the parameter of a function g, either in M or in & , 
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2. in both M and & ', inner functions in g, named hi, are defined and called exclusively: 

(a) in tail position in g, or 

(b) in tail position in some hj (with possibly i = j), or 

(c) in tail position in M, 

3. for all / defined in local position in M, x 6 dom(p7- ■ p) 3i, / = hi, 

4. moreover, if hi is called in tail position in M, then x g dom(p7-), 

5. in & ', x appears necessarily and exclusively in the environments of the /?,'s closures, 

6. & contains only compact closures and Env(j?) U {p,pr} is aliasing-free. 

□ 

We also extend the definition of lambda-lifting (Definition [9] p.|29l> to environments, in 
order to reflect changes in lambda-lifted parameters captured in closures. 

Definition 18 (Lifted form of an environment) 

W^f=[Xx u ...,x n .b,p',3?'] then 

[lx\, . ..,X„X. (fc)»,p'|dom(p')\W> when / = h for some 1 



[Xx\ ,...,x„.(b)„p', (c^"')*] otherwise 



Lifted environments are defined such that a liftable parameter never appears in them. This 
property will be useful during the proof of correctness. 

Lemma 1 If x is a liftable parameter in (M ,pj,p), then x does not appear in {&\. 

Proof Since x is liftable in (M,^,pT,p), it appears exclusively in the environments of hj. 
By definition, it is removed when building □ 

These invariants and definitions lead to an enhanced correctness theorem. 

Theorem 3 (Correctness of lambda-lifting) If x is a liftable parameter in (M,JP,pT,p), 
then 

(M,s) =^ (v,s>) implies <(*)„,) =^=> (v,s') 

Since naive and optimised reductions rules are equivalent (Theorem[2] p. [32}, the proof of 
Theorem[T1(p.l30b is a direct corollary of this theorem. 

Corollary 1 If x is a liftable parameter in M, then 

3t, (M,e) (v,t) implies 3t' , ((M) t ,e) — ^ (v,f') . 



6.3.2 Overview of the proof 



With the enhanced liftability definition, we have strong enough invariants to perform a proof 
by induction of the correctness theorem. This proof is detailed in Section l6.3.5l 

The proof is not by structural induction but by induction on the height of the derivation. 
This is necessary because, even with the stronger invariants, we cannot apply the induction 
hypotheses directly to the premises in the case of the (call) rule: we have to change the 
stores and environments, which means rewriting the whole derivation tree, before using the 
induction hypotheses. 
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For that reason, the most important and difficult case of the proof is the (call) rule. We 
split it into two cases: calling one of the lifted functions (/ = hi) and calling another function 
(either g, where x is defined, or any other function outside of g). Only the former requires 
rewriting; the latter follows directly from the induction hypotheses. 

In the (call) rule with / = hi, issues arise when reducing the body b of the lifted function. 
During this reduction, indeed, the store contains a new location /' bound by the environment 
to the lifted variable x, but also contains the location I which contains the original value of 
x. Our goal is to show that the reduction of b implies the reduction of (&)„, with store and 
environments fulfilling the constraints of the (call) rule. 

To obtain the reduction of the lifted body we modify the reduction of b in a series 
of steps, using several lemmas: 

- the location / of the free variable x is moved to the tail environment (Lemma[2]l; 

- the resulting reduction meets the induction hypotheses, which we apply to obtain the 
reduction of the lifted body (b) t ; 

- however, this reduction does not meet the constraints of the optimised reduction rules 
because the location / is not fresh: we rename it to a fresh location /' to hold the lifted 
variable; 

- finally, since we renamed / to /', we need to reintroduce a location I to hold the original 
value of x (Lemmas [3] and @J. 

The rewriting lemmas used in the (call) case are shown in Section l6.3.3l 

For every other case, the proof consists in checking thoroughly that the induction hy- 
potheses apply, in particular that x is liftable in the premises. It consists in checking invari- 
ants of the extended liftability definition (Definitionll7b. To keep the main proof as compact 
as possible, the most difficult cases of liftability, related to aliasing, are proven in some 
preliminary lemmas (Section r6.3.4l >. 

One last issue arises during the induction when one of the premises does not contain the 
lifted variable x. In that case, the invariants do not hold, since they assume the presence of 
x. But it turns out that in this very case, the lifting function is the identity (since there is no 
variable to lift) and lambda-lifting is trivially correct. 

6.3.3 Rewriting lemmas 

Calling a lifted function has an impact on the resulting store: new locations are introduced 
for the lifted parameters and the earlier locations, which are not modified anymore, are 
hidden. Because of these changes, the induction hypotheses do not apply directly in the 
case of the (call) rule for a lifted function hi. We use the following three lemmas to obtain, 
through several rewriting steps, a reduction of lifted terms meeting the induction hypotheses. 

- Lemma [2] shows that moving a variable from the non-tail environment p to the tail 
environment pj does not change the result, but restricts the domain of the store. It is 
used transform the original free variable x (in the non-tail environment) to its lifted copy 
(which is a parameter of hi, hence in the tail environment). 

- Lemmas [3] and [4] add into the store and the environment a fresh location, bound to an 
arbitrary value. It is used to reintroduce the location containing the original value of x, 
after it has been alpha-converted to /'. 

Lemma 2 (Switching to tail environment) If {M,s} PT ^ X '^ p =^ (y,s') and x £ dom(p;r) 

then (M.s) PT = ' ^ ==>- ^v, ^'|dom(.s-')\{/})- Moreover, both derivations have the same height. 
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Proof By induction on the structure of the derivation. For the (val), (var), (assign) and (call) 
cases, we use the fact that s\pr ■ (x, I) = >?'|dom(s')\{/} when s' = s \ pj. □ 

Lemma 3 (Spurious location in store) If (M,s) P ' ^ P > (v,s') and k does not appear in 

either s, & or pj ■ p, then, for all value u, (M,s + {k i— > u}} Pl ^ P > (v,s' + {k i— > u}). More- 
over, both derivations have the same height. 

Proof By induction on the height of the derivation. The key idea is to add (k, u) to every 
store in the derivation tree. A collision might occur in the (call) rule, if there is some j 
such that lj = k. In that case, we need to rename lj to some fresh variable l'j ^ k (by alpha- 
conversion) before applying the induction hypotheses. □ 

Lemma 4 (Spurious variable in environments) 

V/, / , (M, s) > {v, s ) iff (M, s) => ( v, s ) 

Moreover, both derivations have the same height. 

Proof By induction on the structure of the derivation. The proof relies solely on the fact that 
pT-(x,l)-p=p T -(x,l)-(x,l')-p. □ 

6.3.4 Aliasing lemmas 

We need two lemmas to show that environments remain aliasing-free during the proof by 
induction in Section l6.3.5l They are purely technical lemmas that consist in proving that the 
aliasing invariant (Invariant [6] Definition 1 17b holds in the context of the (call) and (letrec) 
rules, respectively. We only show the (call) lemma here; the (letrec) lemma is very similar 
and detailed in the technical report 1281 . 

Lemma 5 (Aliasing in (call) rule) Assume that, in a (call) rule, 

— &f=[Xx h ...,x n .b,p',&'}, 

— Env(j£") is aliasing-free, and 

— p" = (x\, h) • . . . ■ (x n ,l n ), with fresh and distinct locations l,. 
Then Env(^' + {/ H- & f}) U {p',p"} is also aliasing-free. 

Proof Let S = Env( J?' + {/ h-> J?/}) U {p'}. We know that S C Env(^) so S is aliasing- 
free We want to show that adding fresh and distinct locations from p" preserves this lack of 
aliasing. More precisely, we want to show that 

Vpi,p 2 S <?U {p"},Vx 6 dom(pi), Vy G dom(p 2 ), pi x = p 2 y =>• x = y 

given that 

Vpi,p2 G S^x G dom(pi),Vy G dom(p 2 ), pi x = p 2 y x = y. 

We reason by checking of all cases. If p\ G S and p 2 G S, immediate. If pi = p 2 = p" then 
p" x = p" y x = y holds because the locations of p" are distinct. If pi = p" and p 2 G S 
then pi^ = p 2 y=>x = y holds because p\ x p 2 y (by freshness hypothesis). □ 



37 



6.3.5 Proof of correctness 

We finally recall and show Theorem[3l(p.l34l>. 

Theorem 3 (Correctness of lambda-lifting) If x is a liftable parameter in (M,JP,pT,p), 
then 

(M,s) (v/) implies {(M) t ,s) =M=> (v,s') 

Assume that.v is a liftable parameter in (M, J? ,pr,p). The proof is by induction on the 

Pt \p 

height of the reduction of (M, s) > (v,s'). We only show the case (call). The full proof 

is available in the technical report 1 28 1 . 

(call) — first case First, we consider the most interesting case where there exists i such 
that / = hi. The variable x is a liftable parameter in (hi(ai,...,a n ),^,pT,p) hence in 
(a h ^,e,p T -p) too. 

By the induction hypotheses, we get 

((«;)*> s i) ' P ^' P > (vi,Si+i)- 

By the definition of lifting, (/i/(ai,. . = ftf((ai) % , . . . , {a n ) t ,x). But jc is not a liftable 

parameter in (£>, J?' ,p" ,p') since the Invariant [4] might be broken: jc ^ dom(p") (x is not a 
parameter of hi) but h j might appear in tail position in b. 

On the other hand, we have x 6 dom(p'): since, by hypothesis, x is a liftable parameter 
in (hiifli, . . . ,a n ),^,pT,p), it appears necessarily in the environments of the closures of the 
hi, such as p'. This allows us to split p' into two parts: p' = (x, I) ■ p'". It is then possible to 
move (x, I) to the tail environment, according to Lemma|2j 

{b,S„ +l + {k ^ V|>) ===== > {V,S |dom( S ')\{/}) 

This rewriting ensures that x is a liftable parameter in J?' + {/ h-> & f},p" • (x,l),p m ) 
(by Lemma[5]for the Invariant 
By the induction hypotheses, 

((*)»,. ?„+i +{h ^ V,}) =^====^==> (v,/|dom( S ')\{0) 

The / location is not fresh: it must be rewritten into a fresh location, since x is now a param- 
eter of hi. Let I' be a location appearing in neither + {/ i— > &f})*, nor s n+ \ + {/,■ !->• v,} 
or p" • Pj'. Then /' is a fresh location, which is to act as / in the reduction of (fc)„. 

We will show that, after the reduction, I' is not in the store (just like / before the lambda- 
lifting). In the meantime, the value associated to I does not change (since I' is modified 
instead of I). 

Lemma [T] implies that x does not appear in the environments of so it does not 

appear in the environments of ( j^"' + {/ h-> C either. As a consequence, lack 

of aliasing implies by Definition [15] that the label /, associated to x, does not appear in 
(J* 7 + ^/})« either, so 

(jr< + {/ _> ^/}) J/'/7] = (JF' + {/ 1 ^ . 
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Moreover, I does not appear in s'|dom(i')\{/}- Since V does not appear in the store or the 
environments of the reduction, we rename I to I': 

((b)*,s n+1 [l'/l]+{k^Vi}) ==^M^=^ (v,s'| dom(j , n{ , } ). 

We want now to reintroduce I. Let v x = s„ + \ I. The location / does not appear in s n+ i [/'/ 1] + 
{U h> Vi}, + {/ <-> J*"/}),, or p"(x, I') ■ p'". Thus, by LemmaH 

p"(x l')\p"' 

((*)., j„+i [/'/'] + {/< ^ v,-} + {/ m. v,}) , " F =» (V| dom(s0 y W + {/ ^ v,}) . 

Since 

■sw+l + {k i-> v;} + {7 >->• v,} = s n+ i [Z'/Z] +{/•-> v A } + {/,• ^ v,} because V;', / / /,■ 

= + {/' i — y Vx] + {/,- 1 — y v,} because v x = 

= s„_|_i + {/,■ i-> v, } + {/' h-^ Vjc} because Vi, I' ^ h 

and 5'|dom(.s')\{/} + U ^ v v} = > s ' + v A }, we finish the rewriting by Lemma[4] 

((b) M + {k i y v,} + {/' i y v,}> gff^gg > (v, s' + {l^ V. t }> . 
Hence the result: 

(JQ, ft;= [lxi,...,j W c.(A)»,p , ,(^OJ 
p = (jci, h) • . . . ■ (x n ,l n ) (x, Pt x) /' and /,■ fresh and distinct 

Vi, ((«;)*> si) |PJ P > (vi,s i+ i) ((*) ,j b+ i) |P7 ' P > (v x ,s„+i) 
<(*)^ n+ , + {/,- ^ v,.} + {/' h+ v,}) : • (v, s' + [l » v x }) 

(CALL) — 

((^•(ai,...,^)),,*!) == > (v,s' + {l t-^v x }\p T ) 

Since i 6 dom(p7-) (because x is a liftable parameter in (/j,(ai, . . . ,a„), J?,p7,p)), the ex- 
traneous location is reclaimed as expected: s' + {/ ^> v*} \pr = s '\pT- 



(call) — second case We now consider the case where / is not one of the h-,. The variable jc 
is a liftable parameter in . . . ,a„), J^,p;r,p) hence in (a,-, ^,e,pr ■ p) too. 

By the induction hypotheses, we get 



({ad*, St) 




and, by Definition [9] 

(/(ai,...,a„)) 4 =/((ai)*,...,(a„)J. 

If x is not defined in or ^\ then ()„ is the identity function and can trivially be applied 
to the reduction of b. Otherwise, x is a liftable parameter in (b,^ 1 + {/ h-> j?/},p",p') 
— when checking the invariants of Definition [T7] we use Lemma [5] for the Invariant [6] and 
check separately / = g and / ^ g for the Invariantsf3]and|4](see the technical report [28] for 
more details). 
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By the induction hypotheses, 



hence: 



/= [kx 1 ,...,x n .(b) it ,p , ,(^%] 
p" = (xi, h) ■ . . . • (x n ,l„) k fresh and distinct 

Vi, ((fli)*,*/) ====> (vi,S i+ i) ((fc)»,*n+l + {h ^ v ; }> ===^====^ (v,/) 

p~]p 

((f(a u ...,a n )) lt ,si} T > (v,s'\p T ) 

(jr)„ 

Other cases The detailed proof of the other cases is available in the technical report [28 1. 
They are mostly straightforward by induction, except (letrec) which requires an additional 
aliasing lemma to check the liftability invariants. □ 



7 Conclusions and further work 

In this paper, we have described CPC, a programming language that provides threads which 
are implemented, in different parts of the program, either as extremely lightweight heap- 
allocated data structures, or as native operating system threads. The compilation technique 
used by CPC is somewhat unusual, since it involves a continuation-passing style (CPS) 
transform for the C programming language. We have shown the correctness of that particular 
instance of the CPS transform, as well as the correctness of CPC's compilation scheme; 
while other efficient systems for programming concurrent systems exist, we claim that the 
existence of a proof of correctness makes CPC unique among them. 

CPC is highly adapted to writing high-performance network servers. To prove this fact, 
we have written Hekate, a large scale BitTorrent seeder in CPC. Hekate has turned out to be 
a compact, maintainable, fast and reliable piece of software. 

We enjoyed writing Hekate very much. Due to the lightweight threads that it provides, 
and due to the determinacy of scheduling of attached threads, CPC threads have a very 
different feel from threads in other programming languages; discovering the right idioms 
and the right abstractions for CPC has been (and remains) one of the most enjoyable parts 
of our work. 

For CPC to become a useful production language it must come equipped with a consis- 
tent and powerful standard library that encapsulates useful programming idioms in a gener- 
ally usable form. The current CPC library has been written on an on-demand basis, mainly 
to meet the needs of Hekate; the choice of functions that it provides is therefore somewhat 
random. Filling in the holes of the library should be a fairly straightforward job. For CPC to 
scale easily to multiple cores, this extended standard library should also offer the ability to 
run several event loops, scheduled on different cores, and migrate threads between them. 

We have no doubt that CPC can be useful for applications other than high-performance 
network servers. One could for example envision a GUI system where every button is imple- 
mented using three CPC threads: one that waits for mouse clicks, one that draws the button, 
and one that coordinates with the rest of the system. To be useful in practice, such a system 
should be implemented using a standard widget library; the fact that CPC integrates well 
with external event loops indicates that this should be possible. 
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Finally, the ideas used in CPC might be applicable to programming languages other than 
C. For example, a continuation passing transform might be a way of extending Javascript 
with threads without the need to deploy a new Javascript runtime to hundreds of millions of 
web browsers. 



Software availability 

The full CPC translator, including sources and benchmarking code, is available online at 
http : //www .pps .univ-paris-diderot . f r/~kerneis/ sof tware/cpc/ 

The sources of Hekate are available online at http: //www. pps. univ-paris-diderot .f r/~kerneis/sof tware/hekate/ 
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